-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2028] Implement RockDbBasedMap as an alternate to DiskBasedMap in ExternalSpillableMap #3117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3117 +/- ##
============================================
- Coverage 46.01% 44.29% -1.73%
+ Complexity 5306 4615 -691
============================================
Files 911 826 -85
Lines 39476 36669 -2807
Branches 4254 3949 -305
============================================
- Hits 18166 16243 -1923
+ Misses 19456 18674 -782
+ Partials 1854 1752 -102
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
hudi-common/src/main/java/org/apache/hudi/common/util/collection/RocksDBDAO.java
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
…ethod have wrong filter condition (apache#3109)
…culative execution is enabled (apache#3093) * unit tests added
…thsOfInstant (apache#3125) Co-authored-by: 喻兆靖 <[email protected]>
…Table is enabled. (apache#3079)
…te instants from the dataset timeline. (apache#3082)
…eClient$commitstats (apache#3050)
…tionPlanOperator (apache#3105) Co-authored-by: 喻兆靖 <[email protected]>
…apache#3111) [HUDI-2069] KafkaAvroSchemaDeserializer should get sourceSchema passed instead using Reflection
…tadata Safely (apache#3138) Co-authored-by: 喻兆靖 <[email protected]>
…can not assign initially (apache#3148)
…HMS in different regions (apache#2542) [global-hive-sync-tool] Add a global hive sync tool to sync hudi table across clusters. Add a way to rollback the replicated time stamp if we fail to sync or if we partly sync Co-authored-by: Jagmeet Bali <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we have more test coverage in TestExternalSpillableMap. Can we parametrize the tests that are applicable and run those for both type of spillable maps.
In general, do we have tests around diff values for maxInMemorySizeInBytes for external spillable map. If not, do you think we can add them while we are at this.
hudi-common/src/main/java/org/apache/hudi/common/util/collection/SpillableDiskMap.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/collection/SpillableDiskMap.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/collection/SpillableRocksDBBasedMap.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public Stream<R> valueStream() { | ||
| return keySet.stream().sorted().sequential().map(valueMetaData -> (R) get(valueMetaData)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am not very conversant w/ this spillable map in general.
May I know is there a necessity to sort here? I don't see external spillable map in master does any sort for valueStream. also, for putAll() I don't see any validation as such for sorted entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I made it consistent with DiskBasedMap, but in disk based map we do it to sort the offsets for ensuring sequential reads. But for rocksDb, we do not need to do it and leave it to RocksDb to do it.
hudi-common/src/main/java/org/apache/hudi/common/util/collection/SpillableRocksDBBasedMap.java
Show resolved
Hide resolved
| while (itr.hasNext()) { | ||
| HoodieRecord<? extends HoodieRecordPayload> rec = itr.next(); | ||
| oRecords.add(rec); | ||
| assert recordKeys.contains(rec.getRecordKey()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor. why are we not testing for value equality. is there any particular reason ?
...ommon/src/test/java/org/apache/hudi/common/util/collection/TestSpillableRocksDBBasedMap.java
Show resolved
Hide resolved
...ommon/src/test/java/org/apache/hudi/common/util/collection/TestSpillableRocksDBBasedMap.java
Outdated
Show resolved
Hide resolved
...ommon/src/test/java/org/apache/hudi/common/util/collection/TestSpillableRocksDBBasedMap.java
Show resolved
Hide resolved
…putFormat#MergeIterator to avoid StackOverflow (apache#3159)
…he#3168) Co-authored-by: 喻兆靖 <[email protected]>
…et io for flink batch compaction (apache#3169)
I had tests for TestExternalSpillableMap in the stacked diff, anyway added it here.
We do test if the right amount of keys were in memory and the right amount was spilled over. Wondering how different values of maxInMemorySizeInBytes will help? |
Co-authored-by: 喻兆靖 <[email protected]>
* [HUDI-1944] Support Hudi to read from committed offset * [HUDI-1944] Adding group option to KafkaResetOffsetStrategies * [HUDI-1944] Update Exception msg
Co-authored-by: 喻兆靖 <[email protected]>
|
Moved to #3194 |
|
Moved to #3194 |
What is the purpose of the pull request
This pull request adds a new alternative based on RockDb for the Disk Based Map that is used within the ExternalSpillableMap. Our benchmark results shows that RockDb may improve performance significantly when the data set is large while available memory may be scarce. RockDb supports compression, efficient memory usage and native library, that may be more efficient in certain situations. By default, disk based map will be used, and a config change will be required to enable rocksDb.
In this PR, the rocksDB support is only enabled for HoodieMergeHandle, and a subsequent PR will extend it to all consumers of ExternalSpillableMap (tracked here HUDI-2044)
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
Added the unit test in TestSpillableRocksDBBasedMap
Updated the test for TestExternalSpillableMap